172 ◾ Bioinformatics
As shown in Figure 5.2, the alignment statistics of one of the BAM files show the total
number of reads, the average reads length, the number and percentage of uniquely mapped
reads, splice statistics, statistics of the reads mapped to multiple genes, statistics of the
unmapped reads, and chimeric reads. Pay attention to the reads mapped to multiple loci
and chimeric; when their number is large, that indicates low-quality alignment. Remember
that this BAM file includes the alignments of chromosome 22 only. The number of reads
will be huge if the BAM file contains the alignments of all chromosomes.
In addition to the statistics in the STAR log files, there are a variety of programs for
assessing alignments in BAM files. Examples of those programs include Qualimap [16],
RNA-seQC [17], and RSeQC [18]. Those programs compute metrics for RNA-Seq data,
including per-transcript coverage, junction sequence distribution, genomic localization of
reads, 5′–3′ bias, and consistency of the library protocol. As an example, you can download
and use Qualimap to obtain an overall view about the alignment quality on an HTML for-
mat. You can download Qualimap from “http://qualimap.conesalab.org/” and unzip it in
your project directory. Run Qualimap for each sample and study the reports carefully. The
following script is an example of how to use it:
mkdir qc
qualimap_v2.2.1/qualimap rnaseq \
-outdir qc \
-a proportional \
-bam bam/norm_rep1.bam \
-p strand-specific-reverse \
-gtf gtf/hg38.ncbiRefSeq.gtf \
--java-mem-size=8G
The above script creates the directory “qc” where the Qualimap output files will be saved.
The program takes a BAM file and the reference annotation file as inputs and generates
an HTML report that includes summary statistics about read alignments, reads genomic
origin, transcript coverage profile, splice junction analysis, and figures about read genomic
origins, coverage profile along genes, coverage histogram, and junction analysis. As a biol-
ogist, you may need to study these metrics to have a general idea about the sample align-
ment before proceeding.
5.3.4 Quantification
Gene profiling or studying gene expression is centered in the quantification of aligned
reads per gene or locus. Quantification of reads begins by counting the number of reads
aligned to each gene annotated on the sequence of the reference sequence. Given a BAM
file with aligned RNA-Seq reads and a list of genomic features in an annotation file (GFT
format), the task of the read counting program is to count the number of reads mapping
to each feature. In general, a feature, in this case, is a gene which represents a transcript
or unions of exons of a gene for eukaryotic organisms. Some programs can also consider
exons as features. This is especially useful for checking alternative splicing in the eukary-
otic genes. A read in the BAM file may map to a single feature (unique) or may map or